CINDI: A Virtual Library Indexing and Discovery System

نویسندگان

  • Bipin C. Desai
  • Rajjan Shinghal
  • Nader Shayan
  • Youquan Zhou
چکیده

THISARTICLE DPSCRIBES A SYSTEM CALLED CINDI for cataloging and searching documents in a distributed virtual library. Mihen putting a document in the library, the author provides and registers metadata in the form of a semantic header for the document. The semantic header contains information on both the syntactic and semantic content of the document. An expert system simulating the expertise of a cataloging librarian helps the provider fill the semantic header according to accepted library practice. Later, if someone is searching for documents in the library, then this searcher is helped by another component of the expert system in properly formulating the query. This component simulates the expertise of a reference librarian. The system then uses information provided by the semantic headers in locating and accessing documents wanted by the searcher. INTRODUCTION Avirtual library is a collection of electronic documents and resources distributed across a computer communication network (Saunders, 1993). These documents must be cataloged adequately so that a future interested reader (searcher) can find and access them with relative ease. Many systems (Kahle, 1991; Pinkerton, 1994; Mauldin, 1995; Welcome, 199.5) catalog a document on the basis of words selected from it. They do not use the document’s semantic contents but generally use a program (called a robot, worm, spider, or crawler [Web robots, 19961) which traverses the network accessing the documents to be cataloged. Bipin C. Desai, Raijan Shinghal, Nader R. Shayan, and Youquan Zhou, Department of Computer Science, Concordia University, 14.55 de Maisonneuve Blvd. West, Montreal, CANADA H3G 1M8 LIBRARY TRENDS, Vol. 48, No. 1, Summer 1999, pp. 209-233 01999 The Board of Trustees, University of Illinois 210 LIBRARY TRENDS/SUMMER 1999 An efficient cataloging system calls for a precise description of the semantic contents of documents. A number of systems have addressed the problem of cataloging among which CORE (Cromwell, 1994), MARC (Byrne, 1991; Crawford, 1984; Petersen & Molholt, 1990), MLC (Horny, 1985; Ross &West, 1985; Rhee, 1985), and TEI (Gaynor, 1994; Giordano, 1994) can be mentioned. These systems, however, are mainly designed for professional catalogers. Creating indexes based on search robots has the following disadvantages: repeated attempts by robots to find new resources would increase the traffic on the network; the number of these robots is increasing and system administrators would likely disallow visits by robots; a robot-based approach would become difficult to justify if the network switches to a fee-for-use mode of operation (Brody, 1995; Brownlee, 1995; Cocchi, Estrin, Shenker, & Zheng, 1991; MacKie-Mason, 1997). Searching with the more recent indexing systems (AltaVista, InfoSeek, Lycos, Yahoo) is cumbersome since the number of hits can be prohibitive due to poor selectivity of the supported search terms (Desai, 1997a). Metadata should be designed so as to provide the semantic content of an information resource and be better suited to support its subsequent discovery than the resource itself. In many cases, the resource itself may not be able to provide its semantic contents by its nature, or it may do so only after a fairly extensive and time-consuming computation. Examples of such resources are the following forms of information: audio, video, and collections of program codes. Our metadata takes the form of a semantic header (SH) (Desai, 1994a). Details of SH and its comparison to the Dublin Metadata Element List (DMEL) are described by Desai (1997). The use of the DMEL in representing Web objects is given by Qin (1998). When an author puts a document on the net, she is the one who knows the document well and can semantically describe it best. Accordingly, she fills in the slots in the semantic header. For an efficient search, the index is stored in database registries distributed across the network. Since the document provider fills her own semantic header, costly professional indexing is not required. In this article, we describe an indexing and discovery system called CINDI (Concordia INdexing and DIscovery System), which helps a document provider fill in the semantic header for her document and register it on the net (see Figure 1). Once registered, CINDI provides the facility for a searcher to locate the semantic header and then the document. CINDI thus allows a document to be searched not only on its syntax but also on its semantics. In this article, we use the term “provider” for one who makes a document available on the Internet; a “searcher” is one who looks for document(s); a “user” can be a provider or a searcher. The organization of this article includes a discussion of the knowledge discovery problem; an overview of CINDI; the registering and maintenance of the semantic header; the expert and database system used; DESAI ET AL./CINDI 211 and the communication process. Owing to space limitation, the last two sections describe briefly the searching and annotation features of CINDI. The current implementation status of CINDI and our future plans are given in the conclusion. DISCOVERY ON THE INTERNET In June 1995, we made a series of tests on a number of then existing Internet indexing systems; these were ALIWEB, DACLOD, EINet Galaxy, GNA Meta-Library, Harvest, InfoSeek, Lycos, Nikos, RBSE, World Wide Web Catalog, WebCrawler, WWW, and Yahoo. The intent of these tests was to determine how many URLs to documents containing the target search strings Bipin (AND) Desai were indexed by these systems. The results obtained are given in Table 1which shows the number of hits, mishits, and misses (Desai, 1995a). In this table and the following tables, the number of hits is the count of the documents found to be relevant to the query. The number of duplicates is the number of times the same document was retrieved by the indexing system using different components of the search criteria or when the same document is being served from more than one site. In the more recent search engines, the systems tend to eliminate the former form of duplicates; however, the same document accessible from more than one site is replicated in the result. The number of mis-hits is that of irrelevant documents, and the number of misses is the number of relevant documents missed by the search system. Many of these pioneering indexing systems, existing in mid 1995, are no longer active. In the meantime, a number of new systems, such as Alta Vista, OpenText, Hotbot, and so on have emerged. Many workers in the domain of the digital virtual library feel that these newer systems have addressed many of the issues we raised in designing the CINDI System. Table 1. SEARCHSTATISTICS THE SEARCH BIPIN (AND) DESAI:.~UNE FOR USING TERM 1995 Search System Number Number of Number of Number of of Hits Duplicates Mis-hits Items Missed Aliweb none 25 DACLOD none 25 EINet 6 0 4 23 GNA Meta Lib. none 25 Harvest none 25 InfoSeek 7 0 0 18 Lycos Nikos 231 none 2 222 18 25 RBSE 8 8 25 W3 Catalog Web Crawler none 7 3 0 25 21 www 2 0 0 23 Yahoo none 25 212 LIBRARY TRENDS/SUMMER 1999 The next series of tests was done from September through October 1997 to find the number of relevant documents that could be located by the then current search engines and to evaluate the usefulness of the index entries retrieved. Relevance ofa document could be,judged easily once the target set was known. M7e repeated the test performed in 1995 with the same search words. At the time of the test, some 325 URLs were known to contain the words “Bipin” and “Desai.” These represent Web documents pertaining to one of the authors of this article. The complete list of these URLs can be retrieved from the following URL: http://~~~~~.cs.concordia.ca/ $\sim$faculty/bcdesai/search-oct97/whereis-Desai. h tml. The first set of tests, the results of which are given in Table 2, was done on the following search engines: Alta Vista, Excite, Hotbot, Infoseek, Lvcos, OpenText, and Yahoo. Table 2 S E k R C H ST4TISTlCs FOR U 5 I N G THF SPARCH T ~ R M SEPT. 1997 BIPIN(AND) DE S h I S rmch Number .umber of ,Vzcmbrr of Numbn of L\rumber of Szctrm of Hztr Dublaratry Mzs-hztc Drfunrt Itrms ilfztrrd Altclvl%Jtn/ 97 9 23 4 264 Yflhoo Exczte 114 10 29 7 247 InfoSerk 8 2 1 1 319 Lyo7 57 7 15 14 297 Hotbol 245 28 58 19 1.55 OtienText 19 7 5 318 As in the 1995series of tests, we have shown the results by noting the number of hits produced, the number of duplicates, number of mis-hits, arid the number of relevant documents not listed in the result; we have also included a column for the number of defunct URLs (which do not lead to any valid target Web pages). The duplicates are either the same document being served from two sites or the same document from the same site listed more than once. The latter errors seem to have been corrected in most search engines which do sufficient pre-processing of the result to eliminate obvious duplicates before presenting it to users. The documents missed could be due to the approximations used by engines such as Alta Vista when it finds a large number of hits. However, the fact that these search engines could not locate all documents indicates the difficulty of reaching isolated URLs by search robots. The bigger problem is the lack of selectivity and the measure ofusefulness of the documents found by the search engines. We have collated the results by following the trail of “next” sets of URLs, and these could be viewed by pressing on the number of hits for each search engine in the online version of Table 2 (Desai, 1997a). A glance at the abstract or sumDESAI ET AL./CINDI 213 mary presented by the search engine indicates that they are not very revealing and, except for the most pedestrian needs, following the pointers would result in a drain of the searcher’s time. SEARCHSTATISTICS VARIOUS STRATEGIES FOR USING SEARCH In a third series of tests, we used a simple search with the search terms: Bipin Desai, the advanced search expressions “Bipin Desai,” and “Bipin C. Desai” respectively. These tests were made only on Alta Vista/ Yahoo. The results of these tests are given in Table 3. Table 3. SLARCHSTATISTICS SEARCH SEPT. 1997 FOR USINGTHE VARIOUY TERMS: Search Numbm Number of Number of Number of Number of System of Hzts Duplarates Mzs-hzts Dpfunct Items Missed Alta Vasta/ Yahoo 4285 30-90% 10.80% 200+ Alta Vzsta/ Yahoo 29 2 13 3 312 Alta Vzsta/ Yahoo 128 14 10 201 The result for a simple search of Bipin Desai (row 1 of Table 3) shows a high number of hits (4,285 in the test reported here; there is a bit of variation due to Alta Vista’s method of abandoning a search after a sufficiently large number of hits is made). However, the simple search produces very low selectivity and relevance. Most of the hits in the top 160 entries are irrelevant, and a large number of relevant documents are not located. Most searchers will not have the patience to go through more than a few pages of the result, there being some 214 pages of the result for 4,285 hits. The result for an advanced search expression for “Bipin Desai” (row 2 of Table 3) gives a lower number of hits and relevance since the author prefers to include his middle initial in the name. Most searchers may not be aware of such details. The result for an advanced search expression for “Bipin C. Desai” (row 3 of Table 3) gives a relatively large number of relevant documents, some of which are duplicates, being accessible from more than one site. Some of the defunct UlUs are not deleted by the search engines, pointing to the maintenance problem of the underlying database. However, this search still missed about two-thirds of the documents. These tests lead us to believe that a search system should support better semantics. It is our opinion that the semantic header-based system (see Figure 2) (Desai, 1997b), wherein the provider of the resource is responsible for generating the entry, would be a more useful scheme to 214 LIBRARY TRENDS/SUMMER 1999 support discovery. The semantic header is designed to describe the semantic contents of the source information resource and is better suited to supporting knowledge discovery than the actual resource. Many formats of a resource may not be directly accessible electronically, be suitable for direct discovery, or may require a considerable amount of computation and extremely slow response. The semantic header could also be used as a surrogate to express semantic dependencies inherent in a collection, which is not possible to do with existing search engines. The quality and the reliability of the document could be expressed by including reviewers’ comments in the form of annotation with the semantic header. Such reviews are rarely accessible in traditional cataloging systems. However, in the CINDI system this, along with the abstract supplied by the authors, would be valuable in judging the suitability of a document to a searcher. It could also give feedback to the provider. The semantic header metadata also allow the server system to perform initial query processing and thus reduce the cost involved in accessing and processing irrelevant resources. OVERVIEW OF THE CINDI SYSTEM The overall structure of the CINDI system is shown in Figure 1. The workstation at the provider’s site contains the CINDI client software and a partial catalog. The client software is composed of a registering graphical user interface, the client portion of a distributed expert system, and the associated knowledge base. The semantic header information entered by the provider of a resource using this graphical interface is relayed from the user’s workstation by a client process to the database server process at one of the nodes of the SH Distributed Database (SHDDB). The node is chosen based on its proximity to the workstation or on the subject of the index record. On receipt of the information, the server verifies the correctness and authenticity of the information and, on finding everything in order, sends an acknowledgment to the client. It also has a partial catalog of the thesaurus database. The function of these are described later in the section on the Semantic Header Registration System. The server node is responsible for locating the partitions of the thesaurus for the subject hierarchy or the sites of the SHDDBs where the entry should be stored and forwards the replicated information to appropriate nodes. The server node is also responsible for providing the catalog information for the search system. In this way, the various sites of the database work in cooperation to maintain consistency of the replicated database. The replicated nature of the database also ensures distribution of load and continued access to the system when some sites are temporarily nonfunctional. The user interface for the CINDI system consists of three graphical interfaces: the SH index registration system, the search system, and the DESAI ET AL./CINDI 215

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using semantic templates for a natural language interface to the CINDI virtual library

In this paper, we present our work in building a template-based system for translating English sentences into SQL queries for a relational database system. The input sentences are syntactically parsed using the Link Parser, and semantically parsed through the use of domain-specific templates. The system is composed of a pre-processor and a run-time module. The pre-processor builds a conceptual ...

متن کامل

NLIDB Templates for Semantic Parsing

In this paper, we present our work in building a template-based system for translating English sentences into SQL queries for a relational database system. The input sentences are syntactically parsed using the Link Parser, and semantically parsed through the use of domain-specific templates. The system is composed of a pre-processor and a run-time module. The pre-processor builds a conceptual ...

متن کامل

Assessment of "drug-likeness" of a small library of natural products using chemoinformatics

Even though natural products has an excellent record as a source for new drugs, the advent of ultrahigh-throughput screening and large-scale combinatorial synthetic methods, has caused a decline in the use of natural products research in the pharmaceutical industry. This is due to the efficiency in generating and screening a high number of synthetic combinatorial compounds; whereas traditional ...

متن کامل

Assessment of "drug-likeness" of a small library of natural products using chemoinformatics

Even though natural products has an excellent record as a source for new drugs, the advent of ultrahigh-throughput screening and large-scale combinatorial synthetic methods, has caused a decline in the use of natural products research in the pharmaceutical industry. This is due to the efficiency in generating and screening a high number of synthetic combinatorial compounds; whereas traditional ...

متن کامل

Indexing and Searching Virtual Libraries

It is well known that selectivity leaves a lot to be desired in searching for information resources on the Internet with existing search systems[DESA4]. This has prompted a number of researchers to turn their attention to the development and implementation of models for indexing and searching information resources on the Internet. In this white paper we examine briefly the results of a simple q...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Library Trends

دوره 48  شماره 

صفحات  -

تاریخ انتشار 1999